Skip to content

Conversation

@AndyAyersMS
Copy link
Member

Experiment with turning non-escaping new (nongc)[n] into stackallocs.
Also enable new (nongc)[100] if the allocation site is within a loop, also via stackalloc.

Currently no restriction on how big (that will have to change).

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 5, 2025
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 5, 2025

@hez2010 fyi

Expect this may blow up in some tests with stack overflows; we'll see. Also I forgot to exclude sites in handers, so things may blow up there too.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Feb 5, 2025

Some preliminary notes. @dotnet/jit-contrib @davidwrighton @jkotas interested in your feedback.

This builds up on #104906

JIT-introduced stackalloc

Currently the JIT will never introduce a stackalloc into a method, but allowing this may be interesting.

Escape Analysis

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method. A successful proof requires knowing everything that can possibly happen to that allocation. Proofs of non-escape often founder at call boundaries, since the JIT generally has no knowledge of callee behavior. For instance x will be considered escaping in M if the JIT cannot inline Q.

M()
{ 
     int[] x = new int[100];
     Q(x);
}

Even when non-escape can be proven, stack allocation may not be possible. For example in the following code snippet, the array assigned to x cannot be stack allocated since the array size is not known to the JIT, and the known-sized arrays assigned to y cannot be stack allocated since the allocation site is in a loop. In these cases the amount of stack growth required is not known at JIT time.

N(int n)
{
    int[] x = new int[n];
    for (int i = 0; i < n; i++)
    {
        int[] y = new int[100];

        ;; uses of x and y
    }
}

Limitations like these can be overcome by allowing the JIT to introduce stackalloc into methods. But this comes with other complications:

  • stack space is limited, so these allocations cannot always go on the stack. Either large values of n or even modest values of n if the stack is already close ot its limit can introduce stack overflows into methods that would not have overflowed without stack allocation. So it seems any such allocation must conditionally be on the stack, or somewhere else (say an arena).
  • we generally cannot introduce stackalloc into catch or handler code (finally blocks)
  • objects with GC fields cannot (currently) be handled this way.
    • there is currently no way for a method to describe a runtime-varying number of GC roots to the runtime
    • arrays are heap objects, and writes to arrays require (unchecked) write barriers
    • there is no way to do a store covariance check without a write barrier (easy to fix)
    • many other runtime helpers assume object references are always heap references
    • stack allocated GC roots may end up being treated as so-called "untracked" lifetimes, extending the GC lifetime of the objects they reference
    • diagnostics may become more challenging

None of these seem fundamentally hard to solve, though the cost of checked write barriers might be enough to dissuade us.

If non-escape can be proven but stack allocation is conditional or not possible, the resulting object is still "thread private" and can be optimized more aggressively than if it was a general heap object. There are also widely used idioms (eg in Enumerators) where objects clone themselves to provide thread private access. The JIT does not understand these patterns, but we could work on enhancing the memory analyses the JIT does to try take advantage of this information.

This draft PR introduces stackalloc for non-GC type arrays when the JIT can prove non-escape. This currently has no policy attached. The initial thought for a policy is to leverage the same logic as in TryEnsureSufficientExecutionStack: at each allocation site, check the available stack capacity, and if the stack is not too full, allocate on the stack (perhaps with some additional per-allocation limit), else allocate on the heap.

These changes may well make the allocating methods slightly slower, as the cost of the array zeroing now must be directly paid by the method, rather than incurred by GC or slow allocation helpers. But they may well make the overall system faster. In some case the JIT may be able to prove that the zeroing is not necessary, if all elements are written before being read, but that's a ways off (if ever).

Span Captures (not part of this PR)

Escape analysis can possibly also leverage the fact that an allocation may be opaquely captured by a byref like struct. For instance in

O()
{
    Span<int> x = new int[100];
    Q(x);
}

the array lifetime cannot exceed Os lifetime and so the array can be safely stack allocated (here "opaquely" meaning there is no way to extract the captured object from the struct). There is no need to analyze or inline Q.

As above if the array size is unknown or the allocation site is in a loop then allocation would require stackalloc and associated policies.

In general doing this sort of thing requires "field-wise" escape analysis which is something I intend to work on, but it seems likely that just handing Span might be an easier and valuable special case; a span local has at most one GC reference inside it, so we can likely conflate the object and the reference and just leverage our current analysis.

Enabling this would potentially allow replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2025

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method

Do you have good examples in BCL or other real-world code where this kicks in?

replacing some explicit stackalloc uses in the BCL with completely "safe" alternatives.

The typical use of stackallocs in BCL are constant-sized stackallocs or stackalloc+ArrayPool combos. Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

BCL stackallocs and stackalloc+ArrayPool combos have other safety problems:

  • The memory is uninitialized. I doubt that we would be willing to pay for initialization of these buffers throughout the BCL.
  • The memory has to be returned to the array pool exactly once.

For the BCL use cases in particular, it may be more interesting to work on #52065 and base this optimization on top of it:

  • It would allow us to replace the stackalloc+ArrayPool combos throughout the BCL
  • JIT would inject code that returns the memory to the pool at the end of the method. Alternatively, we can work with Roslyn to introduce constructs for enforced deterministic destruction so that the cleanup code is injected in IL.
  • This array stackalloc optimization can use the same primitive.

@jkotas
Copy link
Member

jkotas commented Feb 5, 2025

Span x = new int[100];
Q(x);

If we had malloca-like API, I think this specific example can be converted to it as an optimization in Roslyn as well.

@hez2010
Copy link
Contributor

hez2010 commented Feb 5, 2025

Do you see the unsafe nature of the stackalloc uses in the BCL in unbounded stackalloc that may slip through code review?

Sometimes we may need to return a buffer to its caller so that we cannot use stackalloc. If the method is managed to get inlined to its caller, with this analysis we may get rid of the heap array allocation.

Do you have good examples in BCL or other real-world code where this kicks in?

Some typical scenarios that this may kick in after we have the support for gcref arrays are like string.Split and Regex.Matches etc.

@AndyAyersMS
Copy link
Member Author

The JIT relies on escape analysis to prove that a particular allocation done by a method cannot outlive ("escape") a call to the method

Do you have good examples in BCL or other real-world code where this kicks in?

Mostly this was an exploration of how hard it would be to enable the transformation in the JIT, and to contemplate what else might need to be addressed.

I have started scouting around for potential impact but it will take a while to get a useful set of data. I also need to build up a better automated analysis for categorizing the things that block and unblock allocation (at least for the first blocker) and make sure we're not missing anything simple in our analysis.

With this PR as is, on one large internal application that has likely been extensively hand tuned, there are roughly 22K Tier1 optimized methods, 2.3K methods with array creation sites, and 4.1K total array allocation sites.

2 of the arrays are stack allocated (not sure if via localloc). I don't have a breakdown yet of what blocks the other 4K.

For some context, on this same application, conditional escape analysis for enumerators kicks in for around 200 methods.

@AndyAyersMS
Copy link
Member Author

Remaining failures all look like stack overflows -- there needs to be a per-instance size limit as well as a dynamic size limit. So seems like this sort of transformation is feasible.

Adding a per-instance limit will introduce conditional heap/stack allocation, so that seems like an easy next step.

@AndyAyersMS
Copy link
Member Author

SPMI failures are timing issue with GUID update. Still seeing (expected) stack overflows in some cases as there is no limit yet as to how much we'll put on the stack (coming soon).

@dotnet-policy-service
Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Apr 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants